Copy, please, these files and directories to your personal directory:

cp -r /data/shared/AGE2020/Exercises/E03-qPCR ~/AGE2020/Exercises

And switch the R working directory to the current exercise: setwd("~/AGE2020/Exercises/E03-qPCR")


Introduction

Quantitative real-time PCR monitors the amplification of a targeted DNA molecule during the PCR (i.e., in real time), not at its end, as in conventional PCR.

In real time reverse transcription quantitave PCR (RT-qPCR), we measure the amplification of target reverse transcribed DNA molecules. This amplification depends on initial RNA concentration and approximates the gene expression. We select a fluorescence signal threshold S_T and measure crossing points C_P (also called threshold cycle C_T) for samples. To correct for initial RNA concentration and quality, reference genes are used (dashed curves). These genes are considered to be equally expressed in every condition (purple and orange curves), and so serve as the baseline. First, we normalize each sample to reference gene by subtracting the C_P^r from C_P^t. We obtain the \Delta C_t values which are used to calculate the relative change of gene expression between samples, fold-change: FC = 2^{\Delta C_{P,cnt} - \Delta C_{P,trt}}. We usually report FC in log2 scale: LFC = \Delta C_{P,cnt} - \Delta C_{P,trt} = log_2(FC)

In real time reverse transcription quantitave PCR (RT-qPCR), we measure the amplification of target reverse transcribed DNA molecules. This amplification depends on initial RNA concentration and approximates the gene expression. We select a fluorescence signal threshold \(S_T\) and measure crossing points \(C_P\) (also called threshold cycle \(C_T\)) for samples. To correct for initial RNA concentration and quality, reference genes are used (dashed curves). These genes are considered to be equally expressed in every condition (purple and orange curves), and so serve as the baseline. First, we normalize each sample to reference gene by subtracting the \(C_P^r\) from \(C_P^t\). We obtain the \(\Delta C_t\) values which are used to calculate the relative change of gene expression between samples, fold-change: \(FC = 2^{\Delta C_{P,cnt} - \Delta C_{P,trt}}\). We usually report \(FC\) in log2 scale: \(LFC = \Delta C_{P,cnt} - \Delta C_{P,trt} = log_2(FC)\)

This will be an introductory exercise in which you will refresh your R skills and implement (mainly visualization) functions, which you will use later on different types of data.

Our experiment

We will work with data from pancreatic tumour experiment, where an influence of spirulina algae was investigated. In this experiment there are two sample groups:

  • control: samples from tumour.
  • spirulina: samples from tumour treated by extract from spirulina.

Each sample group has three biological replicates (grown on different Petri dish). In addition, each biological replicate is cultivated for 50% and 90% confluence, and for each confluence there are two technical replicates. Uhhh…

But we can expand it to:

- control group
  |
  |-- biological replicate #1 (Petri dish #1)
  |   |-- confluence 50
  |   |   |-- technical replicate #1 -> sample C1_C50_R1
  |   |   |-- technical replicate #2 -> sample C2_C50_R2
  |   |-- confluence 90
  |       |-- technical replicate #1 -> sample C1_C90_R1
  |       |-- technical replicate #2 -> sample C1_C90_R2
  |-- biological replicate #2 (Petri dish #2)
      |-- confluence 50
      |   |-- technical replicate #1 -> sample C2_C50_R1
      |   |-- technical replicate #2 -> sample C2_C50_R2
      |-- confluence 90
          |-- ...

- spirulina group
  |
  |-- biological replicate #1 (Petri dish #1)
  |   |-- confluence 50
  |   |   |-- technical replicate #1 -> sample S1_C50_R1
  |   |   |-- technical replicate #2 -> sample S1_C50_R2
  |   |-- confluence 90
  |       |-- technical replicate #1 -> sample S1_C90_R1
  |       |-- technical replicate #2 -> sample S1_C90_R2
  |-- biological replicate #2 (Petri dish #2)
      |-- confluence 50
      |   |-- technical replicate #1 -> sample S2_C50_R1
      |   |-- technical replicate #2 -> sample S2_C50_R2
      |-- confluence 90
          |-- ...

If you want, we can draw it on the board :)

Also, there are two groups of measured genes:

  1. Target genes - interesting genes found by microarray exploratory analysis.
  2. Housekeeping genes - used as reference genes.

Libraries

library(dplyr)
library(stringr)
library(tidyr)
library(glue)
library(magrittr)

This will be your personal library in which you implement the tasks. Now it contains only function skeletons. These skeletons are not obligatory, but your functions should have output similar to what you will see in this tutorial.

Note that your functions will be universal, serving not only for qPCR data. This is mainly true for exploratory analysis functions (hiearchical clustering, PCA, etc.), where input (expression matrix) is same also for microarrays and RNA-seq data.

source("../age_library.R")

Config

I would recommend you to always define constants at the beginning of a script. It is more transparent, and if you, for example, change some path, you haven’t to replace it in whole source code. Constants are, by convention, named in UPPERCASE.

CP_DATA <- "qPCR.Rds"

Data preprocessing

\(C_P\) matrix

We have already preprocessed a \(C_P\) matrix for you. But if you really want to start from the floor, you can create the \(C_P\) matrix from original files (here and here).

In \(C_P\) matrix, columns are samples, rows are genes, and values are \(C_P\) values. This format is generally used for gene expression data and you will see it also in case of microarrays and RNA-Seq.

cp <- readRDS(CP_DATA) %>%
  as.data.frame()
genes <- rownames(cp)
samples <- colnames(cp)

cp[1:5, 1:5]

CP values >= 40 are considered out of qPCR limit:

cp[cp > 39.9] <- NA

Some genes have so many NA values:

lq_genes <- rowSums(!is.na(cp))
lq_genes
##     CD24     CD31   CXCL12    CXCR4    CXCR7     FLT1    VEGFA   VEGRF2  ACTB H4 GAPDH H5 RPS9 H11  TBP H26 
##       24       19        2       10        0        0       24        2       24       24       24       24

Let’s remove them:

lq_genes <- names(lq_genes[lq_genes <= 2])
genes <- setdiff(genes, lq_genes)
cp <- cp[genes, ]

Now we split genes to target and reference ones. Latter have H on the end of their names:

target_genes <- genes[str_detect(genes, ".+ H", negate = TRUE)]
ref_genes <- setdiff(genes, target_genes)

gene_groups <- if_else(str_detect(genes, " H\\d+$"), "reference", "target")
(genes_df <- data.frame(
  gene = as.character(genes),
  gene_group = factor(gene_groups, levels = c("reference", "target")),
  
  row.names = genes,
  stringsAsFactors = FALSE)
)

For plotting purposes, we will replace NAs by 40:

cp_plot <- replace(cp, is.na(cp), 40)

Phenotypical data (sample sheet)

You usually obtain a sample sheet, but in this case we are going to parse the column names of \(C_P\) matrix. Everything needed is contained there:

  • K and SP: sample groups
  • Number behind sample group: Petri dish
  • Number behind slash (“/”): confluence
  • (2RT): second technical replicate
(sample_names <- colnames(cp))
##  [1] "K1/50"        "K1/50(2RT)"   "K1/90"        "K1/90 (2RT)"  "K2/50"        "K2/50(2RT)"   "K2/90"        "K2/90 (2RT)"  "K3/50"        "K3/50(2RT)"   "K3/90"       
## [12] "K3/90 (2RT)"  "SP1/50"       "SP1/50 (2RT)" "SP1/90"       "SP1/90 (2RT)" "SP2/50"       "SP2/50 (2RT)" "SP2/90"       "SP2/90 (2RT)" "SP3/50"       "SP3/50(2RT)" 
## [23] "SP3/90"       "SP3/90 (2RT)"
pheno_data <- data.frame(
  sample_id = sample_names,
  sample_group_code = str_match(sample_names, "^SP|K")[, 1] %>% recode_factor("K" = "c", "SP" = "sp"),
  petri_number = str_match(sample_names, "^[A-Z]{1,2}(\\d)")[, 2] %>% as.factor(),
  confluence = str_match(sample_names, "\\/(\\d{2})")[, 2] %>% as.factor(),
  replicate = str_match(sample_names, "\\((\\d)RT\\)")[, 2] %>% replace_na("1") %>% as.factor()
) %>%
  dplyr::mutate(
    sample_name_rep = glue("{sample_group_code}{petri_number}_{confluence}_r{replicate}") %>% as.character(),
    sample_name = glue("{sample_group_code}{petri_number}_{confluence}") %>% as.character(),
    sample_group = recode_factor(sample_group_code, "c" = "control", "sp" = "spirulina")
  ) %>%
  dplyr::select(sample_name_rep, sample_name, everything()) %>%
  set_rownames(.$sample_name_rep)

colnames(cp) <- rownames(pheno_data)
colnames(cp_plot) <- rownames(pheno_data)
samples <- colnames(cp)

head(pheno_data)

Long data

We also create long data from cp, pheno_data and genes_df.

cp_long <- as.data.frame(cp) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample_name_rep", values_to = "cp") %>%
  dplyr::left_join(pheno_data, by = "sample_name_rep") %>%
  dplyr::left_join(genes_df, by = "gene") %>%
  dplyr::mutate(
    cp_plot = if_else(is.na(cp), 40, cp)
  ) %>%
  dplyr::select(sample_name_rep, sample_name, gene, cp, cp_plot, gene_group, everything())

head(cp_long)

Exploratory analysis on raw data

We are using exploratory analysis to have an unbiased first look at data. It is also a crucial step to assess the biological quality control, e.g. whether samples from different groups create separate clusters - if not, samples might be swapped.

Hiearchical clustering (TASK 1)

TASK 1: Implement your own function which takes \(C_P\) matrix and returns a dendrogram with coloring by a chosen variable.

color_variables <- c("sample_group", "confluence", "petri_number")

for (variable in color_variables) {
  plot_hc(
    cp_plot, color_by = pheno_data[, variable],
    method_distance = "euclidean", method_clustering = "complete",
    color_by_lab = variable
  )
}

PCA (TASK 2)

TASK 2: Implement a function for PCA visualization with point coloring by a chosen variable.

  • Base R:
plot_pca(cp_plot, color_by = pheno_data$sample_group, color_by_lab = "sample_group")

  • You can use ggplot2, but be prepared to use data in long format:
plot_pca_ggplot2(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence")$plot

plot_pca_ggplot2(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence", label_by = "sample_name_rep")$plot

plot_pca_ggplot2(cp_plot, pheno_data, color_by = "sample_group", shape_by = "petri_number")$plot

  • You can also use GGally::ggpairs() to plot a grid of PCA plots (PC1 vs. PC2, PC1 vs. PC3, PC2 vs. PC3, etc):
plot_pca_ggpairs(cp_plot, pheno_data, n_components = 5, color_by = "sample_group", shape_by = "confluence")$plot

Heatmap (TASK 3)

TASK 3: Implement your own function for heatmap.

You can use a nice package heatmaply which produces interactive HTML heatmaps. Or pheatmap for a better alternative to heatmap.2().

plot_heatmap(as.matrix(cp_plot), color_by = pheno_data$sample_group, color_legend_lab = "Sample group", main = "Treatment (Cp values)")

plot_pheatmap(cp_plot, pheno_data, genes_df, column_color_by = color_variables, row_color_by = "gene_group", main = "qPCR")

plot_heatmaply(cp_plot, pheno_data, genes_df, column_color_by = color_variables, row_color_by = "gene_group", main = "qPCR", key.title = "CP")

Can you see any outlying samples (replicates)? If yes, should they be removed?

Normalization to reference genes

We will use housekeeping genes as reference genes. Housekeeping genes are genes which have similar expression profile regardless the cell type or state. That is why they can be used for sample normalization (as an internal control).

Selecting the reference genes

Correlation

Expression of housekeeping genes is/should be correlated.

main_title <- "Raw data correlation"

t(cp) %>%
  as.data.frame() %>%
  GGally::ggpairs(progress = FALSE) +
  theme_bw()

You can also use base R:

pairs(t(cp))

Let’s look more closely at correlation coefficients:

(gene_cor <- cor(t(cp), use = "pairwise.complete.obs"))
##                 CD24        CD31       CXCR4       VEGFA     ACTB H4    GAPDH H5   RPS9 H11     TBP H26
## CD24      1.00000000  0.03970592 -0.28809768  0.73649418  0.73547823  0.73141005 0.76319246  0.72419041
## CD31      0.03970592  1.00000000 -0.73606175  0.03747391  0.03172092  0.17532289 0.10746231  0.08438151
## CXCR4    -0.28809768 -0.73606175  1.00000000 -0.20149851 -0.18053811 -0.02185175 0.08141219 -0.25461239
## VEGFA     0.73649418  0.03747391 -0.20149851  1.00000000  0.77468336  0.82725327 0.83395485  0.77534719
## ACTB H4   0.73547823  0.03172092 -0.18053811  0.77468336  1.00000000  0.92233124 0.89743784  0.93865311
## GAPDH H5  0.73141005  0.17532289 -0.02185175  0.82725327  0.92233124  1.00000000 0.96305770  0.93496580
## RPS9 H11  0.76319246  0.10746231  0.08141219  0.83395485  0.89743784  0.96305770 1.00000000  0.90135469
## TBP H26   0.72419041  0.08438151 -0.25461239  0.77534719  0.93865311  0.93496580 0.90135469  1.00000000
ggcorrplot::ggcorrplot(gene_cor, method = "circle") +
  ggtitle(main_title)

You can also use lattice:

lattice::levelplot(gene_cor, main = main_title)

Or heatmaply:

plot_heatmaply(gene_cor, feature_data = genes_df, row_color_by = "gene_group", main = main_title)

Housekeeping genes should cluster together. We can try both Euclidean and Pearson distance (1 - correlation coefficient) for hiearchical clustering.

plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "euclidean", main = "Hierarchical clustering (Euclidean distance)")

plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "pearson", main = "Hierarchical clustering (Pearson distance)")

PCA of correlation coefficients is also helpful:

plot_pca_ggplot2(gene_cor, genes_df, color_by = "gene_group")$plot

plot_pca_ggpairs(gene_cor, genes_df, color_by = "gene_group")$plot

geNorm M-values (TASK 4)

TASK 4: Implement a function which computes the M-value.

All reference genes are OK:

for (g in ref_genes)
  compute_m(g, cp_plot[ref_genes, ]) %>% glue("{g}: ", .) %>% cat(sep = "\n")
## ACTB H4: 0.587524037202414
## GAPDH H5: 0.675998845667576
## RPS9 H11: 0.73511830732382
## TBP H26: 0.545368225741302

Compare with target genes - they would not be good reference genes:

for (g in target_genes)
  compute_m(g, cp_plot[target_genes, ]) %>% glue("{g}: ", .) %>% cat(sep = "\n")
## CD24: 1.62202887577501
## CD31: 2.50323654673845
## CXCR4: 2.18054789182113
## VEGFA: 1.59255501294409

Subtracting the reference genes: \(\Delta C_P\)

For each sample we calculate a geometric mean of reference genes:

norm <- apply(cp[ref_genes, ], 2, function(x) {
  log(x) %>% mean() %>% exp()
})

Let’s look how these means behave:

gene_cor_norm <- rbind(cp, norm = norm) %>%
  t() %>%
  cor(use = "pairwise.complete.obs")

genes_df_norm <- genes_df %>%
  dplyr::mutate(gene_group = factor(gene_group, levels = c("reference", "target", "norm"))) %>%
  rbind(norm = c("norm", "norm"))

ggcorrplot::ggcorrplot(gene_cor_norm, method = "circle") +
  ggtitle(main_title)

plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "euclidean", main = "Hierarchical clustering (Euclidean distance)")

plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "pearson", main = "Hierarchical clustering (Pearson distance)")

plot_pca_ggplot2(gene_cor_norm, genes_df_norm, color_by = "gene_group")$plot

plot_pca_ggpairs(gene_cor_norm, genes_df_norm, color_by = "gene_group")$plot

We get \(\Delta C_P\) by subtracting the reference gene means:

(ref_gene_cp_matrix <- matrix(norm, ncol = ncol(cp), nrow = nrow(cp), byrow = TRUE))
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]    [,9]   [,10]    [,11]   [,12]    [,13]    [,14]    [,15]    [,16]   [,17]    [,18]    [,19]
## [1,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [2,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [3,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [4,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [5,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [6,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [7,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
## [8,] 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331 22.74242 22.87819 23.0607 23.2196 23.11612 23.0268 25.53741 25.50754 25.97521 25.79664 25.7239 25.71187 25.56222
##         [,20]    [,21]    [,22]    [,23]    [,24]
## [1,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [2,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [3,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [4,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [5,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [6,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [7,] 25.77045 23.89842 24.23634 24.82614 24.98597
## [8,] 25.77045 23.89842 24.23634 24.82614 24.98597
(delta_cp <- cp - ref_gene_cp_matrix)

Now we average technical replicates:

samples <- unique(pheno_data$sample_name)

delta_cp_avg <- matrix(NA, nrow = nrow(delta_cp), ncol = length(samples), dimnames = list(rownames(delta_cp), samples))
delta_cp_avg[1:5, 1:5]
##         c1_50 c1_90 c2_50 c2_90 c3_50
## CD24       NA    NA    NA    NA    NA
## CD31       NA    NA    NA    NA    NA
## CXCR4      NA    NA    NA    NA    NA
## VEGFA      NA    NA    NA    NA    NA
## ACTB H4    NA    NA    NA    NA    NA
(tech_rep <- split(colnames(delta_cp), pheno_data$sample_name))
## $c1_50
## [1] "c1_50_r1" "c1_50_r2"
## 
## $c1_90
## [1] "c1_90_r1" "c1_90_r2"
## 
## $c2_50
## [1] "c2_50_r1" "c2_50_r2"
## 
## $c2_90
## [1] "c2_90_r1" "c2_90_r2"
## 
## $c3_50
## [1] "c3_50_r1" "c3_50_r2"
## 
## $c3_90
## [1] "c3_90_r1" "c3_90_r2"
## 
## $sp1_50
## [1] "sp1_50_r1" "sp1_50_r2"
## 
## $sp1_90
## [1] "sp1_90_r1" "sp1_90_r2"
## 
## $sp2_50
## [1] "sp2_50_r1" "sp2_50_r2"
## 
## $sp2_90
## [1] "sp2_90_r1" "sp2_90_r2"
## 
## $sp3_50
## [1] "sp3_50_r1" "sp3_50_r2"
## 
## $sp3_90
## [1] "sp3_90_r1" "sp3_90_r2"
for (s in samples)
  delta_cp_avg[, s] <- rowMeans(delta_cp[, tech_rep[[s]], drop = FALSE], na.rm = TRUE)

# For control.
(rowMeans(delta_cp[, c("c1_50_r1", "c1_50_r2")], na.rm = TRUE) == delta_cp_avg[, "c1_50"]) %>%
  all() %>%
  stopifnot()

Final \(\Delta C_P\):

delta_cp_avg <- delta_cp_avg[target_genes, ]
delta_cp_avg_plot <- replace(delta_cp_avg, is.na(delta_cp_avg), ceiling(max(delta_cp_avg, na.rm = TRUE) + 1))

# Remove replicates.
pheno_data_avg <- dplyr::filter(pheno_data, replicate == 1) %>%
  dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
  dplyr::select(sample_name, everything()) %>%
  set_rownames(.$sample_name)

cp_long <- dplyr::filter(cp_long, replicate == 1) %>%
  dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
  dplyr::select(sample_name, everything())

# Add delta_cp_avg
cp_long <- as.data.frame(delta_cp_avg) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp") %>%
  dplyr::right_join(cp_long, by = c("sample_name", "gene"))

# Add delta_cp_avg_plot
cp_long <- as.data.frame(delta_cp_avg_plot) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp_plot") %>%
  dplyr::right_join(cp_long, by = c("sample_name", "gene"))

cp_long_target <- dplyr::filter(cp_long, gene_group == "target")

Exploratory analysis on summarised and normalised data

You can check for outliers, as these could mess up a differential expression analysis.

plot_pca_ggplot2(delta_cp_avg_plot, pheno_data_avg, color_by = "sample_group", shape_by = "confluence")$plot

plot_pca_ggpairs(delta_cp_avg_plot, pheno_data_avg, n_components = 4, color_by = "sample_group", shape_by = "confluence")$plot

for (variable in color_variables) {
  plot_hc(
    delta_cp_avg_plot,
    color_by = pheno_data_avg[, variable],
    method_distance = "euclidean",
    method_clustering = "complete",
    color_by_lab = variable
  )
}

plot_pheatmap(delta_cp_avg_plot, sample_data = pheno_data_avg, column_color_by = color_variables, main = "qPCR")

For plotting purposes (boxplots), we set mean \(\Delta C_P\) of controls to \(0\) and calculate \(logFC\)s (\(\text{mean}(\Delta C_P,control) - \Delta C_P\)). \(logFC\) is also called \(\Delta \Delta C_P\).

(control_means <- cp_long_target %>%
  dplyr::filter(sample_group == "control") %>%
  dplyr::group_by(gene) %>%
  dplyr::summarise(delta_cp_control_mean = mean(delta_cp, na.rm = TRUE)))
cp_long_target <- cp_long_target %>%
  dplyr::left_join(control_means, by = "gene") %>%
  dplyr::mutate(
    delta_cp_centered = delta_cp_control_mean - delta_cp,
    delta_cp_centered_plot = replace_na(delta_cp_centered, ceiling(max(delta_cp_centered, na.rm = TRUE) + 1))
  ) %>%
  dplyr::select(sample_name, gene, matches("_cp|cp_"), everything())

head(cp_long_target)

Boxplots (TASK 5)

Boxplots are very informative when you want to see a difference between groups.

TASK 5: Implement a function to produce a boxplot of \(\Delta C_P\) values.

plot_boxplot_ggplot2(
  cp_long_target,
  x = "sample_group",
  y = "delta_cp_centered_plot",
  feature_col = "gene",
  color_by = "sample_group",
  main = "qPCR"
)

for (g in target_genes) {
  p1 <- plot_boxplot_ggplot2(
    cp_long_target,
    x = "sample_group",
    y = "delta_cp_centered_plot",
    feature_col = "gene",
    feature = g,
    main = "qPCR",
    color_by = "sample_group"
  )
  
  p2 <- plot_boxplot_ggplot2(
    cp_long_target,
    x = "sample_group",
    y = "delta_cp_centered_plot",
    feature_col = "gene",
    feature = g,
    main = "qPCR",
    color_by = "sample_group",
    facet_by = "confluence"
  )
  
  p1_p2 <- cowplot::plot_grid(p1, p2, ncol = 2)
  
  print(p1_p2)
}

t-test (TASK 6)

t-test is testing whether two means coming from data having Student’s distribution significantly differ.

TASK 6: Implement a function which do the t-test of gene from two groups.

for (g in target_genes) {
  test_results <- test_gene(
    gene = g,
    gene_data = cp_long_target,
    gene_col = "gene",
    value_col = "delta_cp_centered",
    group_col = "sample_group",
    test = t.test,
    verbose = TRUE
  )
  cat("\n")
}
## CD24: spirulina vs. control
## t.test p-value: 0.000959 ***
## 
## CD31: spirulina vs. control
## t.test p-value: 0.00535 **
## 
## CXCR4: spirulina vs. control
## t.test p-value: 0.168 NS
## 
## VEGFA: spirulina vs. control
## t.test p-value: 0.00234 **

Or you can return a table for all genes:

(test_table <- test_gene_table(
  gene_data = cp_long_target,
  gene_col = "gene",
  value_col = "delta_cp_centered",
  group_col = "sample_group"
))

Cleanup

save.image("qPCR.RData")

warnings()
traceback()
## No traceback available
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.7.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.7.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] friendlyeval_0.1.1 gplots_3.0.3       ggplot2_3.2.1      rlist_0.4.6.1      RColorBrewer_1.1-2 magrittr_1.5       tidyr_1.0.2        stringr_1.4.0      dplyr_0.8.4       
## [10] glue_1.3.1         knitr_1.28         rmarkdown_2.1     
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.1      httr_1.4.1         jsonlite_1.6.1     viridisLite_0.3.0  foreach_1.4.8      gtools_3.8.1       shiny_1.4.0        assertthat_0.2.1   yaml_2.2.1        
## [10] ggrepel_0.8.1      pillar_1.4.3       digest_0.6.25      ggsignif_0.6.0     promises_1.1.0     colorspace_1.4-1   cowplot_1.0.0      htmltools_0.4.0    httpuv_1.5.2      
## [19] plyr_1.8.5         pkgconfig_2.0.3    ggcorrplot_0.1.3   pheatmap_1.0.12    xtable_1.8-4       purrr_0.3.3        scales_1.1.0       webshot_0.5.2      gdata_2.18.0      
## [28] later_1.0.0        tibble_2.1.3       farver_2.0.3       ggpubr_0.2.5       ellipsis_0.3.0     withr_2.1.2        lazyeval_0.2.2     crayon_1.3.4       mime_0.9          
## [37] heatmaply_1.0.0    evaluate_0.14      GGally_1.4.0       MASS_7.3-51.5      ggthemes_4.2.0     tools_3.6.2        registry_0.5-1     data.table_1.12.8  lifecycle_0.1.0   
## [46] plotly_4.9.2       munsell_0.5.0      cluster_2.1.0      packrat_0.5.0      compiler_3.6.2     caTools_1.18.0     rlang_0.4.4        grid_3.6.2         iterators_1.0.12  
## [55] htmlwidgets_1.5.1  crosstalk_1.0.0    bitops_1.0-6       labeling_0.3       gtable_0.3.0       codetools_0.2-16   reshape_0.8.8      TSP_1.1-9          reshape2_1.4.3    
## [64] R6_2.4.1           seriation_1.2-8    gridExtra_2.3      fastmap_1.0.1      KernSmooth_2.23-16 dendextend_1.13.3  stringi_1.4.6      Rcpp_1.0.3         vctrs_0.2.3       
## [73] gclus_1.3.2        tidyselect_1.0.0   xfun_0.12

HTML rendering

This chunk is not evaluated (eval = FALSE). Otherwise you will probably end up in recursive hell :)

library(rmarkdown)
library(knitr)
library(glue)

# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE)
render("qPCR.Rmd", output_file = "qPCR.html")